Group 1 Members:

  • David Moste
  • Vinayak Kamath
  • Kenan Sooklall
  • Christian Thieme
  • Lidiia Tronina

Introduction

For this project, we will be working with a mental health dataset obtained from a previously completed ADHD research project. The dataset contains information from 175 individuals who took part in the study and includes 54 columns broken into the following categories:

Excel Column Variable Key
C Sex Male-1, Female-2
D Race White-1, African American-2, Hispanic-3, Asian-4, Native American-5, Other or missing data -6
E-W ADHD self-report scale Never-0, rarely-1, sometimes-2, often-3, very often-4
X-AM Mood disorder questions No-0, yes-1; question 3: no problem-0, minor-1, moderate-2, serious-3
AN-AS Individual substances misuse no use-0, use-1, abuse-2, dependence-3
AT Court Order No-0, Yes-1
AU Education 1-12 grade, 13+ college
AV History of Violence No-0, Yes-1
AW Disorderly Conduct No-0, Yes-1
AX Suicide attempt No-0, Yes-1
AY Abuse History No-0, Physical (P)-1, Sexual (S)-2, Emotional (E)-3, P&S-4, P&E-5, S&E-6, P&S&E-7
AZ Non-substance-related Dx 0 – none; 1 – one; 2 – More than one
BA Substance-related Dx 0 – none; 1 – one Substance-related; 2 – two; 3 – three or more
BB Psychiatric Meds 0 – none; 1 – one psychotropic med; 2 – more than one psychotropic med

Our task is first to explore the data, then using the understanding gained, to build a clustering model to determine different segments of patients. Additionally, we’ll perform Principal Component Analysis on a subset of the dataset to see what information can be gleaned. Finally, we’ll use Gradient Boosting and Support Vector Machines to predict if a patient will commit suicide.

As can be seen from the table above, the dataset has a wide variety of numeric and categorical data. We’ll dive deeper into this data now.

Exploratory Data Analysis

Let’s begin exploring by taking a high level look at our dataset:

glimpse(adhd)
## Rows: 175
## Columns: 54
## $ Initial              <chr> "JA", "LA", "MD", "RD", "RB", "SB", "PK", "R…
## $ Age                  <dbl> 24, 48, 51, 43, 34, 39, 41, 48, 44, 27, 44, …
## $ Sex                  <dbl> 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2,…
## $ Race                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ `ADHD Q1`            <dbl> 1, 3, 2, 3, 4, 2, 2, 2, 3, 2, 2, 2, 1, 2, 1,…
## $ `ADHD Q2`            <dbl> 1, 3, 1, 3, 4, 3, 2, 4, 3, 3, 3, 2, 3, 3, 0,…
## $ `ADHD Q3`            <dbl> 4, 4, 2, 2, 2, 1, 1, 3, 3, 4, 2, 2, 0, 2, 1,…
## $ `ADHD Q4`            <dbl> 2, 4, 1, 2, 4, 4, 3, 4, 4, 4, 3, 3, 2, 4, 2,…
## $ `ADHD Q5`            <dbl> 3, 5, 3, 4, 4, 3, 4, 3, 4, 4, 3, 3, 3, 2, 2,…
## $ `ADHD Q6`            <dbl> 1, 2, 3, 3, 2, 2, 4, 3, 3, 3, 3, 2, 3, 0, 3,…
## $ `ADHD Q7`            <dbl> 1, 2, 3, 2, 3, 3, 2, 1, 3, 4, 3, 1, 2, 4, 0,…
## $ `ADHD Q8`            <dbl> 3, 3, 2, 4, 4, 4, 3, 1, 4, 3, 3, 2, 2, 4, 0,…
## $ `ADHD Q9`            <dbl> 2, 2, 0, 4, 4, 4, 3, 4, 3, 2, 2, 3, 1, 2, 2,…
## $ `ADHD Q10`           <dbl> 4, 4, 1, 2, 2, 2, 4, 2, 4, 4, 3, 2, 1, 4, 2,…
## $ `ADHD Q11`           <dbl> 2, 1, 2, 3, 4, 4, 3, 4, 3, 3, 2, 2, 3, 3, 1,…
## $ `ADHD Q12`           <dbl> 4, 4, 0, 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 1, 0,…
## $ `ADHD Q13`           <dbl> 1, 2, 2, 3, 3, 4, 4, 4, 4, 3, 2, 3, 3, 1, 2,…
## $ `ADHD Q14`           <dbl> 0, 4, 2, 3, 2, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3,…
## $ `ADHD Q15`           <dbl> 3, 4, 3, 1, 1, 3, 4, 0, 3, 4, 1, 3, 2, 0, 1,…
## $ `ADHD Q16`           <dbl> 1, 3, 2, 2, 2, 4, 4, 0, 2, 4, 2, 1, 1, 0, 1,…
## $ `ADHD Q17`           <dbl> 3, 1, 1, 1, 1, 3, 2, 1, 4, 2, 1, 1, 0, 0, 3,…
## $ `ADHD Q18`           <dbl> 4, 4, 1, 2, 1, 3, 4, 1, 3, 2, 2, 2, 1, 2, 1,…
## $ `ADHD Total`         <dbl> 40, 55, 31, 45, 48, 55, 54, 41, 56, 56, 42, …
## $ `MD Q1a`             <dbl> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,…
## $ `MD Q1b`             <dbl> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1,…
## $ `MD Q1c`             <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,…
## $ `MD Q1d`             <dbl> 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0,…
## $ `MD Q1e`             <dbl> 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0,…
## $ `MD Q1f`             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,…
## $ `MD Q1g`             <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,…
## $ `MD Q1h`             <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1,…
## $ `MD Q1i`             <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1,…
## $ `MD Q1j`             <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1,…
## $ `MD Q1k`             <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1,…
## $ `MD Q1L`             <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1,…
## $ `MD Q1m`             <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0,…
## $ `MD Q2`              <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,…
## $ `MD Q3`              <dbl> 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 0, 2, 2, 2,…
## $ `MD TOTAL`           <dbl> 15, 14, 5, 13, 7, 14, 9, 7, 12, 11, 16, 0, 1…
## $ Alcohol              <dbl> 1, 0, 0, 1, 1, 1, 3, 0, 1, 0, 3, 0, 1, 0, 3,…
## $ THC                  <dbl> 1, 0, 0, 1, 1, 0, 3, 0, 0, 3, 1, 0, 1, 0, 1,…
## $ Cocaine              <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ Stimulants           <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ `Sedative-hypnotics` <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ Opioids              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0,…
## $ `Court order`        <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ Education            <dbl> 11, 14, 12, 12, 9, 11, 12, 16, 12, 9, 12, 18…
## $ `Hx of Violence`     <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1,…
## $ `Disorderly Conduct` <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ Suicide              <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ Abuse                <dbl> 0, 4, 6, 7, 0, 2, 4, 0, 0, 2, 4, 2, 0, 0, 0,…
## $ `Non-subst Dx`       <dbl> 2, 1, 2, 2, 2, 0, 1, 2, 1, 1, 1, 2, 0, 1, 0,…
## $ `Subst Dx`           <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 2, 0, 0, 0, 1,…
## $ `Psych meds.`        <dbl> 2, 1, 1, 2, 0, 0, 1, 2, 1, 0, 0, 1, 0, 0, 0,…

We can see that there is a mix of both numeric and categorical data. We note that most of the numerical features are actually binary categorical variables. Before going too far, let’s change the data type of these variables so we can explore our data properly:

adhd <- adhd %>% mutate_if(is.double, as.factor)
adhd$Age <- as.integer(adhd$Age)
inspectdf::inspect_types(adhd) %>%
  show_plot

Having made the necessary changes, we can see in the visual above, that our dataset is mostly categorical data. However, there is one column that is of type character and another column that is of type integer.

Numerical Features

table <- describe(adhd$Age)[,c(2,8,3,5,9,4)]
rownames(table) <-  c('Age')
knitr::kable(table) %>%
  kable_styling()
n min mean median max sd
Age 175 1 22.36571 25 42 10.94936

Looking at the summary above, we can see that the age of the subjects of this research study are between 1 and 42 with the median value being 25. Let’s now take a look at the shape of our distribution and see if there are significant outliers:

inspectdf::inspect_num(adhd) %>% 
 show_plot()

In looking at the distribution of age, it appears that our dataset is multi-modal. This often means that there are sub-populations within the data. We’ll keep this in mind as we continue our exploration. We do not see any significant outliers.

Later, we will be building models to attempt to predict individuals who have committed suicide. As such, we’ll look at the distribution of Age broken down by whether an individual has attempted suicide or not:

ggplot(adhd %>% filter(Suicide %in% c(1,0)) ) + 
  aes( y = Age, fill = Suicide) + 
  geom_histogram() +
  coord_flip()

Based on the histogram above, we do not note any significant differences in the distributions. However, this may be more easily determined by looking at a boxplot.

 ggplot(adhd %>% filter(Suicide %in% c(1,0))) + 
       aes(x = Suicide, y = Age) + 
       geom_boxplot(color = 'steelblue', 
                    outlier.color = 'firebrick', 
                    outlier.alpha = 0.35) +
        labs(title = 'Suicide vs Loan_Status', y = 'Age', x= 'Suicide') +
        theme_minimal() + 
        theme(
          plot.title = element_text(hjust = 0.45),
          panel.grid.major.y =  element_line(color = "grey", 
                                             linetype = "dashed"),
          panel.grid.major.x = element_blank(),
          panel.grid.minor.y = element_blank(),
          panel.grid.minor.x = element_blank(),
          axis.ticks.x = element_line(color = "grey")
        )

In looking at the boxplots above, we can see that it does appear that, in general, those who attempted suicide were younger than those who did not.

Categorical Features

Now let’s turn our attention to our categorical variables. We have both binary categorical variables, and variables with 3 or more classes. Looking at this data in a table would be unwieldy due to its size, so we’ll visualize this data instead.

inspect_cat(adhd) %>% 
  show_plot()

In looking at the summary table above, we note:

  • Abuse: Nearly half of our population has been subject to some type of abuse
  • ADHD Q1-18: There is a fairly equal dispersion between each of these responses
  • Alcohol: Alcohol appears to be used by half of our population
  • Cocaine: About a 1/3 of the population uses cocaine
  • Court Order: Majority of the population does not have a court order
  • Disorderly Conduct: Over half of the population has had disorderly conduct
  • Education: A very small percentage of population has attended college
  • Hx of Violence: Subjects with a history of violence is the minority of the population
  • MDQ1-MDQ3: There is a fairly even breakout in response for each of these questions
  • Non-subst Dx: We note the presence of quite a few missing values in this column. Most individuals do not have a non-substance prescription
  • Opiods: Most subjects do not have opiods
  • Psych meds.: The majority of this column is missing data
  • Race: The majority of the population is either white or African American
  • Sedative-hypnotics: Hardly anyone in the population is using sedative-hypnotics
  • Sex: The population is a fairly even split between males and females
  • Stimulants: Stimulant usage among the participants is very minimal
  • Subst Dx: A fairly even split between each of these categories. We note the presence of quite a few missing values in this column.
  • Suicide: It appears that around around a 1/4 of the population has attempted suicide
  • THC: THC only affects around 1/3 of our population

Again, we will be building models to predict who has attempted to commit suicide, so it will also be helpful to understand the the distributions of these variables as it relates to our Suicide variable, both Yes and No. As mentioned above, only 1/4 of the population has attempted suicide, so we expect counts for those who have attempted suicide to be considerably lower than those who have not, however, we perform this exercise to see anything major sticks out as a red flag.

Based on the charts above, we do not note anything we would consider out of the ordinary where suicide attempts outnumber those who have not attempted suicide for any of the variables. In fact, the distributions look fairly similar given our understanding of the percentage of those who have attempted suicide.

Missing Data

As we’ve seen throughout out EDA, there are quite a few missing values within the columns. Let’s take a closer look at this so we can determine the best way to deal with them.

visdat::vis_miss(adhd, sort_miss = TRUE)

It appears that about 2.6% of our dataset is missing values and as previously mentioned, Psych meds. makes up a large portion of that. Because of this, there would be no good way to impute these values and so we have decided to drop this variable from the dataset.

adhd <- adhd %>% 
  select(-`Psych meds.` )

Further, we see a pattern within the missing data that many of the respondents who have missing data about a suicide attempt, also have additional missing data. As our objective is to predict suicide attempts, we will drop those subjects who have missing data for that variable since the data will be unusable when it comes to our modeling.

adhd <- adhd %>% 
  filter(!is.na(Suicide))

Having adjusted these, let’s now look at our missing data again.

visdat::vis_miss(adhd, sort_miss = TRUE)

We can see clearly that these adjustments have made a significant impact on the dataset. There is now only 0.4% of our dataset that is missing. We can now use an imputation method to fill in the remainder of the missing values.

We have chosen to use the pmm method (predictive mean matching) from the mice library to impute our missing values. Predictive mean matching calculates the predicted value for our target variable, and, for missing values, forms a small set of “candidate donors” from the complete cases that are closest to the predicted value for our missing entry. Donors are then randomly chosen from candidates and imputed where values were once missing. To apply pmm we assume that the distribution is the same for missing cells as it is for observed data, and thus, the approach may be more limited when the % of missing values is higher.

In order to apply this method, we’ll need to adjust our column names that have spacing. We can use the clean_names function from the janitor library to do this for us. Once this is completed, we’ll impute our missing values:

adhd <- adhd %>% 
  janitor::clean_names()
adhd <- mice(data = adhd, m = 1, method = "pmm", seed = 500)
## 
##  iter imp variable
##   1   1  education*  abuse*  non_subst_dx*  subst_dx*
##   2   1  education*  abuse*  non_subst_dx*  subst_dx*
##   3   1  education*  abuse*  non_subst_dx*  subst_dx*
##   4   1  education*  abuse*  non_subst_dx*  subst_dx*
##   5   1  education*  abuse*  non_subst_dx*  subst_dx*
##  * Please inspect the loggedEvents
## Warning: Number of logged events: 45
adhd <- mice::complete(adhd, 1)

We can see now that we have successfully imputed all missing values:

visdat::vis_miss(adhd, sort_miss = TRUE)

Modeling

K-Means Clustering

K-Means clustering is a technique of creating clusters by finding groups that are similar, as determined by euclidean distance. This method has many advantages (computationally efficient) and disadvantages (results change with every run, dependent on the starting values, no real way to determine appropriate number of clusters), but we ultimately chose to use k-means instead of hierarchical clustering due to there being no real reason to believe this data has a hierarchical structure to it.

The first step was to pull out only the columns we wanted to use. Since k-means is based on euclidean distance, we chose to only use variables that had continuous values, ruling out all binary and factor based predictors. Some of this data came in as factors (despite it being continuous), so we had to convert those to numeric values.

library(tidyverse)
adhd_km <- adhd %>%
  select(age, md_total, education, adhd_total)
adhd_km <- adhd_km %>%
  mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.numeric)

The next step was to transform the data. Again, the euclidean based distance measurement calls for us to center and scale the data. We also chose to perform a Box-Cox transformation to help reduce skewness.

library(caret)
adhd_km_trans <- preProcess(adhd_km,
                            method = c("center", "scale", "BoxCox"))
adhd_km_trans <- predict(adhd_km_trans, adhd_km)

Now that we have our data all set up, we could go ahead and get a sense of how many clusters to create. To do this, we used the silhouette method. When the average silhouette width starts to decrease, you stop gaining information by dividing clusters. This led us to an optimal number of 2 cluster.

library(factoextra)
## Warning: package 'factoextra' was built under R version 3.6.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(adhd_km_trans, kmeans, method = "silhouette", k.max = 20)
## Registered S3 methods overwritten by 'car':
##   method                          from
##   influence.merMod                lme4
##   cooks.distance.influence.merMod lme4
##   dfbeta.influence.merMod         lme4
##   dfbetas.influence.merMod        lme4

Now we finally get to create and visualize our clusters! To visualize these four dimensional clusters, we used PCA to reduce the dimensionality of our data.

library(factoextra)
km.res <- kmeans(adhd_km_trans, centers = 2, nstart = 50)
fviz_cluster(km.res, adhd_km_trans)

What incredible clusters! We wanted to get a better sense of the differences between these clusters and what made them unique, so we went back to the original data and performed a little bit of analysis. We computed the mean for each factor within each cluster and plotted these values on a graph to see the differences in the two clusters more clearly.

library(tidyverse)
adhd_km_res <- cbind(adhd_km, km.res$cluster)
km_comparison <- adhd_km_res %>%
  group_by(km.res$cluster) %>%
  rename(cluster = `km.res$cluster`) %>%
  summarise_all(mean) %>%
  pivot_longer(c("age":"adhd_total"),
               names_to = "category",
               values_to = "mean")
ggplot(km_comparison,
       aes(x = as.factor(cluster),
           y = mean,
           fill = as.factor(category))) +
  geom_bar(position = "dodge",
           stat = "identity") +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "Cluster",
       y = "Value",
       fill = "Factor")

By looking at these two clusters, we can see that one of our clusters has a high total ADHD and MD score while also having a lower age and education. This would indicate that our two groups are the following: Younger and less educated with high ADHD and MD scores Older and more educated with lower ADHD and MD scores

Out of curiosity, we wanted to add back in a single binary variable - suicide - since we knew we would be investigating it later. We knew k-means would struggle with this, but we were curious to see how much it would impact the outcome.

We went through the same steps as before, but this time included suicide as our first variable.

library(tidyverse)
library(caret)
library(factoextra)
adhd_km <- adhd %>%
  select(suicide, age, md_total, education, adhd_total)
adhd_km <- adhd_km %>%
  mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.numeric)
adhd_km_trans <- preProcess(adhd_km,
                            method = c("center", "scale", "BoxCox"))
adhd_km_trans <- predict(adhd_km_trans, adhd_km)
fviz_nbclust(adhd_km_trans, kmeans, method = "silhouette", k.max = 20)

km.res <- kmeans(adhd_km_trans, centers = 12, nstart = 50)
fviz_cluster(km.res, adhd_km_trans)

adhd_km_res <- cbind(adhd_km, km.res$cluster)
km_comparison <- adhd_km_res %>%
  group_by(km.res$cluster) %>%
  rename(cluster = `km.res$cluster`) %>%
  summarise_all(mean) %>%
  pivot_longer(c("suicide":"adhd_total"),
               names_to = "category",
               values_to = "mean")
ggplot(km_comparison,
       aes(x = as.factor(cluster),
           y = mean,
           fill = as.factor(category))) +
  geom_bar(position = "dodge",
           stat = "identity") +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "Cluster",
       y = "Value",
       fill = "Factor")

Adding in just one binary variable - suicide - had a massive impact on our clustering. The optimal number of clusters increased by 600% all the way up to 12 clusters. All of a sudden we have clusters that fit a bunch of different scenarios. Based on our clusters, we ended up with the following conclusions: * We had 4 groups that had non-zero means for suicide (meaning suicide was attempted) * Of these 4 groups, 3 had higher ADHD scores, 1 was an older group and 1 was a younger group, and education for all 4 groups was low.

We’re curious if the massive increase in the optimal number of clusters was caused by the addition of a single variable or the addition of a binary variable. The best way to tell would be to replace the binary suicide variable with another continuous variable, but we already used all of the provided continuous variables. We could always remove one of our original variables and retain the binary suicide variable, but even this would cause some level of distortion.

Gradient Boosting

We begin by dropping the initial column as that may not carry any predictive information. The data is then split into training and testing

df_split = initial_split(adhd %>% select(-c(initial)), strata=suicide)
df_train <- training(df_split)
df_test <- testing(df_split)

Gradient boosting machines are similar to random forests or bagging approaches, but instead of just growing a large number of trees from independent random samples of the data, we grow them sequentially on transformations of the data. Boosting is a method to improve (boost) the week learners sequentially and increase the model accuracy with a combined model.

After we tried most combinations of the different metrics, we used gbm function setting parameters the following way:

  • the number of trees to 1000, as the minimum number of trees used in the model ensemble is 100. And setting it too high would cause our model to overfit.

  • interaction depth (number of splits it has to perform on a tree) to 4.

  • distribution type to “adaboost”, as we are working with 0-1 outcomes

modelGBM=gbm(as.character(suicide)~., data=data.frame(df_train)
             ,n.trees=1000,distribution='adaboost',interaction.depth=4,shrinkage=0.01)
pgbm=predict(modelGBM,newdata=data.frame(df_test),n.trees = 1000,type='response')
pgbm[pgbm>0.5]=1
pgbm[pgbm<=0.5]=0
summary(modelGBM,
  cBars = 10,
  n.trees = modelGBM$n.trees,
  plotit = TRUE,
  order = TRUE,
  method = relative.influence,
  normalize = TRUE)

##                                   var      rel.inf
## adhd_total                 adhd_total 8.286190e+01
## md_total                     md_total 7.902049e+00
## education                   education 1.182468e+00
## abuse                           abuse 1.167775e+00
## adhd_q14                     adhd_q14 7.516594e-01
## adhd_q15                     adhd_q15 6.714416e-01
## adhd_q1                       adhd_q1 6.095127e-01
## adhd_q11                     adhd_q11 5.135936e-01
## adhd_q10                     adhd_q10 4.551679e-01
## adhd_q4                       adhd_q4 4.135598e-01
## race                             race 3.706968e-01
## alcohol                       alcohol 2.724549e-01
## adhd_q17                     adhd_q17 2.451256e-01
## adhd_q7                       adhd_q7 2.378631e-01
## adhd_q18                     adhd_q18 2.122014e-01
## age                               age 1.905235e-01
## adhd_q5                       adhd_q5 1.900748e-01
## adhd_q16                     adhd_q16 1.851144e-01
## adhd_q9                       adhd_q9 1.814930e-01
## subst_dx                     subst_dx 1.592262e-01
## sex                               sex 1.436468e-01
## adhd_q12                     adhd_q12 1.393376e-01
## adhd_q3                       adhd_q3 1.331517e-01
## adhd_q6                       adhd_q6 1.040196e-01
## adhd_q8                       adhd_q8 1.037221e-01
## adhd_q2                       adhd_q2 1.028580e-01
## adhd_q13                     adhd_q13 8.494249e-02
## non_subst_dx             non_subst_dx 8.409667e-02
## md_q3                           md_q3 8.317285e-02
## cocaine                       cocaine 8.053583e-02
## md_q1m                         md_q1m 7.031097e-02
## hx_of_violence         hx_of_violence 4.110973e-02
## thc                               thc 2.073343e-02
## md_q1d                         md_q1d 7.304742e-03
## md_q1j                         md_q1j 7.274572e-03
## disorderly_conduct disorderly_conduct 6.218441e-03
## md_q1a                         md_q1a 3.724065e-03
## md_q1k                         md_q1k 2.639621e-03
## md_q1g                         md_q1g 2.542308e-03
## md_q1e                         md_q1e 2.400188e-03
## md_q1i                         md_q1i 1.953731e-03
## md_q1c                         md_q1c 1.711626e-04
## md_q1h                         md_q1h 1.673989e-04
## md_q1l                         md_q1l 6.609901e-05
## md_q2                           md_q2 7.590723e-07
## md_q1b                         md_q1b 0.000000e+00
## md_q1f                         md_q1f 0.000000e+00
## stimulants                 stimulants 0.000000e+00
## sedative_hypnotics sedative_hypnotics 0.000000e+00
## opioids                       opioids 0.000000e+00
## court_order               court_order 0.000000e+00

The summary function gave us a variable importance plot. adhd_totaland md_total are the most important variables in the model.

confusionMatrix(as.factor(df_test$suicide),as.factor(pgbm))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 26  3
##          1  8  5
##                                           
##                Accuracy : 0.7381          
##                  95% CI : (0.5796, 0.8614)
##     No Information Rate : 0.8095          
##     P-Value [Acc > NIR] : 0.9112          
##                                           
##                   Kappa : 0.3145          
##                                           
##  Mcnemar's Test P-Value : 0.2278          
##                                           
##             Sensitivity : 0.7647          
##             Specificity : 0.6250          
##          Pos Pred Value : 0.8966          
##          Neg Pred Value : 0.3846          
##              Prevalence : 0.8095          
##          Detection Rate : 0.6190          
##    Detection Prevalence : 0.6905          
##       Balanced Accuracy : 0.6949          
##                                           
##        'Positive' Class : 0               
## 

It appears that Boosting has a correct classification rate of 74%. We’ll use this as a benchmark as we move forward to an SVM method.

Support Vector Machine

A Support Vector Machine (SVM) is an algorithm that searches the feature space for the optimal hyper plane. This hyper plane will separate the features by classes with the maximum margin. Here we train an SVM to find the dividing plane between those who commit suicide and those who don’t base on the features we have.

The trained dataset is fit to an SVM

set.seed(42)
svm_model = svm(suicide~., data=df_train, kernel='linear', probability=TRUE)
summary(svm_model)
## 
## Call:
## svm(formula = suicide ~ ., data = df_train, kernel = "linear", 
##     probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  81
## 
##  ( 50 31 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Our base model consists of 78 support vectors where 45 are assign to label 0 (no suicide) and 33 to label 1 (suicide)

We will tune the svm with the training set to find the best gamma and cost. The tuning process is done with 10 fold cross validation.

svm_tune <- tune.svm(suicide~., data = df_train, gamma = 0.25, cost = seq(0.1,1,0.1))
summary(svm_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##   0.25  0.1
## 
## - best performance: 0.3 
## 
## - Detailed performance results:
##    gamma cost error dispersion
## 1   0.25  0.1   0.3  0.1191534
## 2   0.25  0.2   0.3  0.1191534
## 3   0.25  0.3   0.3  0.1191534
## 4   0.25  0.4   0.3  0.1191534
## 5   0.25  0.5   0.3  0.1191534
## 6   0.25  0.6   0.3  0.1191534
## 7   0.25  0.7   0.3  0.1191534
## 8   0.25  0.8   0.3  0.1191534
## 9   0.25  0.9   0.3  0.1191534
## 10  0.25  1.0   0.3  0.1191534
svm_best <- svm_tune$best.model

It looks like the best parameters are gamma = 0.25 and cost = 0.1

## 
## Call:
## svm(formula = suicide ~ ., data = df_train, cost = 0.1, gamma = 0.25, 
##     kernel = "linear", probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
## 
## Number of Support Vectors:  88
## 
##  ( 52 36 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Now with the best model we can test it aganist the train and test set

y_train = predict(svm_best, df_train %>% select(-c(suicide)))
table(df_train[,'suicide'], y_train)
##    y_train
##      0  1
##   0 84  0
##   1  5 31
y_test = predict(svm_best, df_test %>% select(-c(suicide)))
table(df_test[,'suicide'], y_test)
##    y_test
##      0  1
##   0 24  5
##   1  6  7
confusionMatrix(df_test[,'suicide'], y_test, positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 24  5
##          1  6  7
##                                           
##                Accuracy : 0.7381          
##                  95% CI : (0.5796, 0.8614)
##     No Information Rate : 0.7143          
##     P-Value [Acc > NIR] : 0.4418          
##                                           
##                   Kappa : 0.374           
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5833          
##             Specificity : 0.8000          
##          Pos Pred Value : 0.5385          
##          Neg Pred Value : 0.8276          
##              Prevalence : 0.2857          
##          Detection Rate : 0.1667          
##    Detection Prevalence : 0.3095          
##       Balanced Accuracy : 0.6917          
##                                           
##        'Positive' Class : 1               
## 

As we can see the SVM model has 4 False Positive on the training set. However it makes more mistakes on the testing set, which is to be expected.

pred <- predict(svm_best, df_test, decision.values = TRUE, probability = TRUE)
pred_prob <- attr(pred, 'probabilities')
df_prob <- cbind(df_test %>% select(suicide), y_test, pred_prob)
names(df_prob) = c('y_true' ,'y_pred', 'negative', 'positive')
summary(df_prob)
##  y_true y_pred    negative         positive     
##  0:29   0:30   Min.   :0.6404   Min.   :0.2302  
##  1:13   1:12   1st Qu.:0.6717   1st Qu.:0.2822  
##                Median :0.6935   Median :0.3065  
##                Mean   :0.6954   Mean   :0.3046  
##                3rd Qu.:0.7178   3rd Qu.:0.3283  
##                Max.   :0.7698   Max.   :0.3596

We can see from the predicted probabilities the SVM model is more likely to predict that an individual will not commit suicide.


PCA

PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The i-th principal component can be taken as a direction orthogonal to the first i-1 principal components that maximizes the variance of the projected data.

(Source : https://en.wikipedia.org/wiki/Principal_component_analysis)

We’ll use the prcomp() function to reduce the number of dimensions on our entire dataset and measure the amount of variance explained is beneficial: As mentioned PCA works best with numerical data we will neglect the categorical variable Initials. We are now left with a matrix of 52 columns and 150+ rows which we will pass through prcomp( ) function for the principal component analysis.

This function returns the results as an object of class ‘prcomp’. We will assign the output to a variable named pca.out

For our PCA analysis we will concentrate on the Mood disorders questions. As the mood swings seems to be a strong reason for suicide attempts.

#working on  copy for pca analysis
adhd.pca <- data.frame(adhd)

#droping character column for PCA analysis.
adhd.pca <- adhd.pca[,-1]

#droping ADHD disorders questionnaire column for PCA analysis.
adhd.pca <- select(adhd.pca , -c(adhd_q1,adhd_q2, adhd_q3, adhd_q4, adhd_q5, adhd_q6, adhd_q7
                                 , adhd_q8, adhd_q9, adhd_q10, adhd_q11, adhd_q12
                                 , adhd_q13, adhd_q14, adhd_q15, adhd_q16, adhd_q17
                                 , adhd_q18, adhd_total ))

#converting columns to numeric from factor ofr PCA analysi
adhd.pca$sex <- as.numeric(adhd.pca$sex)
adhd.pca$age <- as.numeric(adhd.pca$age)
adhd.pca$race <- as.numeric(adhd.pca$race)
#adhd.pca$adhd_q1 <- as.numeric(adhd.pca$adhd_q1)
#adhd.pca$adhd_q2 <- as.numeric(adhd.pca$adhd_q2)
#adhd.pca$adhd_q3 <- as.numeric(adhd.pca$adhd_q3)
#adhd.pca$adhd_q4 <- as.numeric(adhd.pca$adhd_q4)
#adhd.pca$adhd_q5 <- as.numeric(adhd.pca$adhd_q5)
#adhd.pca$adhd_q6 <- as.numeric(adhd.pca$adhd_q6)
#adhd.pca$adhd_q7 <- as.numeric(adhd.pca$adhd_q7)
#adhd.pca$adhd_q8 <- as.numeric(adhd.pca$adhd_q8)
#adhd.pca$adhd_q9 <- as.numeric(adhd.pca$adhd_q9)
#adhd.pca$adhd_q10 <- as.numeric(adhd.pca$adhd_q10)
#adhd.pca$adhd_q11 <- as.numeric(adhd.pca$adhd_q11)
#adhd.pca$adhd_q12 <- as.numeric(adhd.pca$adhd_q12)
#adhd.pca$adhd_q13 <- as.numeric(adhd.pca$adhd_q13)
#adhd.pca$adhd_q14 <- as.numeric(adhd.pca$adhd_q14)
#adhd.pca$adhd_q15 <- as.numeric(adhd.pca$adhd_q15)
#adhd.pca$adhd_q16 <- as.numeric(adhd.pca$adhd_q16)
#adhd.pca$adhd_q17 <- as.numeric(adhd.pca$adhd_q17)
#adhd.pca$adhd_q18 <- as.numeric(adhd.pca$adhd_q18)
#adhd.pca$adhd_total <- as.numeric(adhd.pca$adhd_total)
adhd.pca$md_q1a <- as.numeric(adhd.pca$md_q1a)
adhd.pca$md_q1b <- as.numeric(adhd.pca$md_q1b)
adhd.pca$md_q1c <- as.numeric(adhd.pca$md_q1c)
adhd.pca$md_q1d <- as.numeric(adhd.pca$md_q1d)
adhd.pca$md_q1e <- as.numeric(adhd.pca$md_q1e)
adhd.pca$md_q1f <- as.numeric(adhd.pca$md_q1f)
adhd.pca$md_q1g <- as.numeric(adhd.pca$md_q1g)
adhd.pca$md_q1h <- as.numeric(adhd.pca$md_q1h)
adhd.pca$md_q1i <- as.numeric(adhd.pca$md_q1i)
adhd.pca$md_q1j <- as.numeric(adhd.pca$md_q1j)
adhd.pca$md_q1k <- as.numeric(adhd.pca$md_q1k)
adhd.pca$md_q1l <- as.numeric(adhd.pca$md_q1l)
adhd.pca$md_q1m <- as.numeric(adhd.pca$md_q1m)
adhd.pca$md_q2 <- as.numeric(adhd.pca$md_q2)
adhd.pca$md_q3 <- as.numeric(adhd.pca$md_q3)
adhd.pca$md_total <- as.numeric(adhd.pca$md_total)
adhd.pca$alcohol <- as.numeric(adhd.pca$alcohol)
adhd.pca$thc <- as.numeric(adhd.pca$thc)
adhd.pca$cocaine <- as.numeric(adhd.pca$cocaine)
adhd.pca$stimulants <- as.numeric(adhd.pca$stimulants)
adhd.pca$sedative_hypnotics <- as.numeric(adhd.pca$sedative_hypnotics)
adhd.pca$opioids <- as.numeric(adhd.pca$opioids)
adhd.pca$court_order <- as.numeric(adhd.pca$court_order)
adhd.pca$education <- as.numeric(adhd.pca$education)
adhd.pca$hx_of_violence <- as.numeric(adhd.pca$education)
adhd.pca$disorderly_conduct <- as.numeric(adhd.pca$disorderly_conduct)
adhd.pca$suicide <- as.numeric(adhd.pca$suicide)
adhd.pca$abuse <- as.numeric(adhd.pca$abuse)
adhd.pca$non_subst_dx <- as.numeric(adhd.pca$non_subst_dx)
adhd.pca$subst_dx <- as.numeric(adhd.pca$subst_dx)

#prcomp function
pca.out = prcomp(adhd.pca, scale= TRUE )
pca.out
## Standard deviations (1, .., p=33):
##  [1] 2.649866e+00 1.667582e+00 1.517793e+00 1.410442e+00 1.317541e+00
##  [6] 1.234053e+00 1.203317e+00 1.107268e+00 1.053497e+00 1.012235e+00
## [11] 9.818435e-01 9.567827e-01 9.126114e-01 8.790474e-01 8.642734e-01
## [16] 8.389314e-01 8.141526e-01 7.718777e-01 7.511017e-01 7.202164e-01
## [21] 6.961629e-01 6.542946e-01 6.371036e-01 6.321703e-01 5.901505e-01
## [26] 5.827049e-01 5.224433e-01 5.091218e-01 4.918273e-01 4.603005e-01
## [31] 4.076608e-01 3.891165e-02 4.072422e-16
## 
## Rotation (n x k) = (33 x 33):
##                            PC1         PC2         PC3         PC4
## age                 0.03469272 -0.02795421  0.10612354 -0.27323774
## sex                -0.03529444 -0.22687351 -0.08987261  0.06522574
## race                0.06586145  0.28566259  0.22420588 -0.19317186
## md_q1a             -0.27044583  0.02843719 -0.09549368 -0.07566968
## md_q1b             -0.24862956 -0.17445100 -0.06222698 -0.07295610
## md_q1c             -0.15515216  0.26266927  0.21383572 -0.02629291
## md_q1d             -0.21969524 -0.01139197  0.05523678 -0.07662104
## md_q1e             -0.23075163  0.04234849  0.14067828 -0.05575442
## md_q1f             -0.24413027 -0.17143952  0.02798320 -0.02055926
## md_q1g             -0.24696673 -0.20604513  0.02006128 -0.10716428
## md_q1h             -0.22245811  0.14788474  0.18999798  0.11320563
## md_q1i             -0.21574050  0.16426516  0.15323829  0.03519754
## md_q1j             -0.21277108  0.12077625  0.18128584  0.04417958
## md_q1k             -0.20291342  0.17075271  0.07676026  0.08296841
## md_q1l             -0.26635638  0.01470877 -0.05531620  0.03779050
## md_q1m             -0.21100877  0.05281260  0.01556825  0.02569672
## md_q2              -0.26244396 -0.16156199  0.13637663  0.03491599
## md_q3              -0.18692853 -0.25054736 -0.09241663 -0.07575819
## md_total           -0.37077958 -0.02033346  0.08336609 -0.02175389
## alcohol            -0.12004834  0.08896461 -0.09911649  0.13433274
## thc                -0.02050055  0.19066834 -0.19793178  0.15824150
## cocaine            -0.04517386  0.36426575 -0.06055117 -0.08625893
## stimulants         -0.03777763 -0.02694936 -0.11917140  0.19607340
## sedative_hypnotics -0.01304696 -0.05938442 -0.12763572  0.35183905
## opioids            -0.05247028 -0.11539783 -0.14156431  0.38222201
## court_order        -0.04106587  0.02931041 -0.15043748  0.07385034
## education           0.10651109 -0.12501654  0.44105813  0.37698186
## hx_of_violence      0.10651109 -0.12501654  0.44105813  0.37698186
## disorderly_conduct -0.06230218  0.21697473 -0.26654149  0.04419898
## suicide            -0.13273367 -0.15471925 -0.16679636  0.15351100
## abuse              -0.04040704 -0.15737653 -0.22772880  0.01583052
## non_subst_dx        0.01155786 -0.32267256  0.05776974 -0.13821692
## subst_dx           -0.06300744  0.28407666 -0.20726604  0.35400797
##                            PC5          PC6         PC7         PC8
## age                 0.47278164 -0.270578426  0.19775076 -0.04330043
## sex                 0.22789549  0.422738904  0.10345553  0.35686128
## race               -0.09637874  0.111913168  0.02023633  0.28232588
## md_q1a             -0.03782752 -0.126405201  0.01744824 -0.01413718
## md_q1b             -0.16543807 -0.014347712 -0.12662419  0.17346536
## md_q1c              0.09490764 -0.040626756 -0.01941350  0.06331634
## md_q1d              0.05512104 -0.050769426  0.11001808  0.02195578
## md_q1e              0.14622817  0.087617400  0.13069549  0.15040371
## md_q1f             -0.06466367 -0.023417435  0.04733737  0.07619921
## md_q1g             -0.08725510  0.001101435  0.03903264  0.11720881
## md_q1h              0.06245039  0.271122543 -0.04237007 -0.04853672
## md_q1i              0.14250045  0.191593915  0.01207008 -0.08599415
## md_q1j              0.03831358  0.293985344 -0.04203836 -0.21677885
## md_q1k             -0.07537880  0.174498227 -0.05990734 -0.30203480
## md_q1l             -0.07001815 -0.273978098 -0.11545942 -0.06367887
## md_q1m              0.12188641 -0.282952915 -0.09231209 -0.03952059
## md_q2              -0.19175816 -0.117596274 -0.00373041  0.16279108
## md_q3              -0.05749273 -0.112697053 -0.01660499  0.12651368
## md_total           -0.01815860 -0.016876427 -0.02703472  0.02545512
## alcohol             0.15374251 -0.243420922  0.31313585 -0.27517669
## thc                -0.34744264  0.083090024  0.26117533  0.31251349
## cocaine             0.21699906  0.025545086  0.17392878  0.18916273
## stimulants         -0.19337073 -0.041896184  0.15751462  0.04026151
## sedative_hypnotics  0.22457658  0.059213989 -0.37050617 -0.07201711
## opioids             0.21786394  0.029460240 -0.26879342  0.04486552
## court_order        -0.23985533  0.271707256  0.35087544 -0.30921483
## education          -0.04645289 -0.158187822  0.21367705  0.05999551
## hx_of_violence     -0.04645289 -0.158187822  0.21367705  0.05999551
## disorderly_conduct -0.14475021 -0.235385249  0.13374782 -0.12281490
## suicide             0.08924076  0.119062079  0.15529349 -0.15570934
## abuse               0.32599073  0.060288594  0.33683599  0.16236720
## non_subst_dx        0.04779839  0.130844302  0.25223430 -0.29076332
## subst_dx            0.15380509 -0.103606673  0.07469640  0.21527276
##                             PC9          PC10         PC11         PC12
## age                 0.133176264 -3.052525e-02 -0.015951238 -0.083591933
## sex                 0.039997480  1.466959e-01 -0.023498922 -0.100533619
## race               -0.386907599 -7.643193e-02  0.126265212 -0.072983447
## md_q1a              0.032414083  7.127731e-02  0.180503066  0.190896260
## md_q1b             -0.105960674 -1.246399e-01 -0.134001492 -0.007910102
## md_q1c              0.094016163  1.737132e-02 -0.080067316 -0.099504904
## md_q1d              0.025009366 -2.353973e-01  0.364118655 -0.393120078
## md_q1e              0.102473263  2.123074e-01  0.168033281  0.151930884
## md_q1f              0.131852140 -1.103187e-01 -0.059881380  0.015404054
## md_q1g             -0.005209731 -1.098592e-01 -0.025440045 -0.003290307
## md_q1h              0.086040431 -2.642789e-03  0.036487092 -0.015638509
## md_q1i              0.194669499  2.891357e-02 -0.210925587 -0.077573658
## md_q1j              0.046316217  1.288044e-01 -0.010286094  0.085761970
## md_q1k             -0.233588519 -1.926662e-01  0.172746114  0.273791842
## md_q1l              0.024765307 -1.858152e-01  0.039486294  0.265833170
## md_q1m             -0.155035075  2.676809e-01 -0.306783859 -0.017530621
## md_q2               0.030476658 -1.200344e-01 -0.046194288 -0.095152168
## md_q3              -0.150642388  3.723367e-01 -0.196165357 -0.123396779
## md_total           -0.004459661  5.205068e-02 -0.041352876  0.001041501
## alcohol            -0.362609156  2.157563e-01 -0.061603604 -0.102738704
## thc                 0.019397114 -8.222242e-02 -0.251419520  0.180575433
## cocaine             0.114413785 -2.693442e-02 -0.106170776  0.068926781
## stimulants          0.154433646  5.056098e-01  0.558068390  0.094704649
## sedative_hypnotics -0.088145406 -4.956261e-03 -0.144510740  0.209193232
## opioids             0.276744596 -8.693465e-02  0.077018016 -0.265238048
## court_order         0.053788812  7.626777e-02 -0.296854189 -0.294052851
## education           0.013529385 -6.513594e-02 -0.048118528  0.032910360
## hx_of_violence      0.013529385 -6.513594e-02 -0.048118528  0.032910360
## disorderly_conduct  0.437089938 -2.092485e-01 -0.011507480 -0.094311994
## suicide            -0.372355970 -2.656045e-01  0.185347415 -0.272121979
## abuse              -0.080053240 -2.518829e-01 -0.003927053  0.351770988
## non_subst_dx        0.142774767 -1.547262e-02 -0.048952407  0.321002849
## subst_dx           -0.134550716 -5.005299e-05  0.036022214  0.010185238
##                             PC13         PC14         PC15        PC16
## age                -0.0566835833  0.163616334  0.012297629  0.24199815
## sex                -0.3149072797 -0.047271547  0.225311064 -0.08363222
## race                0.1481320472  0.082170128  0.015323069 -0.08418820
## md_q1a             -0.0307574353 -0.142836076  0.137302947 -0.19383557
## md_q1b             -0.0075661030 -0.090756403  0.055728793  0.02457904
## md_q1c              0.4587862022 -0.420417251 -0.115378732  0.11296587
## md_q1d              0.0232887585 -0.250551962  0.179881414 -0.08397800
## md_q1e             -0.1700571432 -0.280241398  0.077864771  0.26432469
## md_q1f             -0.2377936951  0.056933826 -0.068781708  0.24276227
## md_q1g              0.1113020936  0.347037567 -0.011928770 -0.06340376
## md_q1h             -0.2021262501  0.055227529 -0.411538837 -0.01267809
## md_q1i              0.1391623356  0.152084344 -0.400990623 -0.02541391
## md_q1j              0.0006926974  0.079675318  0.436293234 -0.09918934
## md_q1k             -0.1455080744  0.008480059  0.065864158 -0.01353261
## md_q1l              0.0208396651  0.267129995  0.045754453 -0.08946704
## md_q1m             -0.0864165864  0.112193589  0.036567748 -0.01924319
## md_q2               0.0100129311  0.097308663  0.029697232  0.15526761
## md_q3               0.2567400929 -0.128481931  0.005953032 -0.21375150
## md_total            0.0478185986 -0.028648309  0.009420751 -0.02407997
## alcohol            -0.3247676383 -0.104343186 -0.148056062 -0.02649561
## thc                -0.1087895821 -0.169723443 -0.205027261  0.08020182
## cocaine             0.1594818917  0.320854953  0.252912706 -0.09741821
## stimulants          0.2426042800  0.244189866 -0.118722054  0.20153971
## sedative_hypnotics  0.2309244363 -0.162910974  0.203122192  0.41381435
## opioids             0.0015252036  0.086552953 -0.121570123 -0.28590673
## court_order         0.1636974608  0.154698357  0.245582510  0.15316215
## education           0.0163118730  0.008955391  0.115138035 -0.04181500
## hx_of_violence      0.0163118730  0.008955391  0.115138035 -0.04181500
## disorderly_conduct -0.0702596164 -0.190800072  0.151453792  0.01470623
## suicide             0.1963560061  0.030606298 -0.110984797  0.15148028
## abuse               0.2053274927  0.033526867 -0.080407467  0.09662056
## non_subst_dx        0.1979817441 -0.226810587 -0.112313461 -0.41547120
## subst_dx            0.0329927791 -0.014852893  0.042269592 -0.32352038
##                             PC17        PC18          PC19         PC20
## age                -0.1870326699 -0.02365295 -0.0117718053 -0.077746561
## sex                -0.1462178029 -0.08763934 -0.2634067449  0.022750722
## race                0.0953467402  0.03638251 -0.2031385134  0.029784290
## md_q1a              0.0174326509 -0.01902714 -0.2180371086  0.449362965
## md_q1b             -0.3718839962  0.09825459 -0.0278371827 -0.023705035
## md_q1c              0.1194484634 -0.16435465 -0.0789192301 -0.213613049
## md_q1d              0.0619530246  0.36819155 -0.0099590179  0.003661264
## md_q1e              0.2614331687 -0.17702293  0.1789539877  0.003872197
## md_q1f              0.3003612487  0.22281552  0.2723682222 -0.099204693
## md_q1g              0.2109967516 -0.37334403  0.0683324997 -0.152417977
## md_q1h              0.0006187089  0.19273542 -0.0585774650  0.369583110
## md_q1i             -0.2992088240  0.02755310 -0.1363472552 -0.043488447
## md_q1j             -0.1472169243 -0.28207854 -0.0881950334 -0.246076181
## md_q1k             -0.0771208635  0.20794337  0.1232363649 -0.285598549
## md_q1l              0.0057437817  0.02069467 -0.2362904536 -0.010916597
## md_q1m              0.3973305215 -0.02120433 -0.2934970419  0.164172519
## md_q2              -0.1648016440 -0.13969418  0.2080801007  0.018659831
## md_q3              -0.2105938751  0.24904888  0.1271540043 -0.059873173
## md_total           -0.0197334612  0.06137333 -0.0006977298 -0.027775182
## alcohol            -0.1820015150 -0.07026246  0.0956119986 -0.144798299
## thc                -0.0236739359 -0.12382008  0.0165876036 -0.013157247
## cocaine            -0.1228473101  0.11275844  0.4526434824  0.316320096
## stimulants         -0.1045699017  0.05519618 -0.0528155014 -0.053845352
## sedative_hypnotics -0.0595682203  0.09175329  0.0875320511  0.202051868
## opioids             0.1754332562 -0.01200602  0.0329000806 -0.169082560
## court_order         0.2611983949  0.24423125 -0.0924090299  0.006698803
## education          -0.0510309202  0.04415637 -0.0586475426  0.072776513
## hx_of_violence     -0.0510309202  0.04415637 -0.0586475426  0.072776513
## disorderly_conduct -0.1783780631 -0.15117536 -0.1927103963  0.040641543
## suicide            -0.0382237128 -0.39632642  0.0465325545  0.322696844
## abuse               0.0588274512  0.23558516 -0.3264000560 -0.214625464
## non_subst_dx        0.0973527244 -0.07849497  0.2071535114  0.137387595
## subst_dx            0.1188237993 -0.02903089  0.2173885911 -0.155169110
##                            PC21         PC22          PC23        PC24
## age                -0.170897495  0.193573958  0.0255079652  0.31301604
## sex                 0.047421155  0.134956383  0.1010911002  0.08370083
## race                0.159887305 -0.141562557 -0.1879680691  0.48740095
## md_q1a              0.161547706  0.315140028 -0.2192332996 -0.17167110
## md_q1b             -0.288197625 -0.101481533 -0.2228550358 -0.16766404
## md_q1c             -0.044161287  0.324717761 -0.0880631348 -0.13563961
## md_q1d             -0.115488196 -0.250324317  0.4180957121 -0.03936866
## md_q1e             -0.064158483 -0.256973739 -0.2646152392  0.01724736
## md_q1f              0.463246339  0.298199187  0.0965312647  0.15877717
## md_q1g              0.091283278 -0.392803372 -0.1218131778 -0.05134410
## md_q1h             -0.051777496 -0.121814172 -0.1178974225 -0.02013894
## md_q1i              0.239243829 -0.143772590  0.1948224884 -0.05787820
## md_q1j              0.057730397  0.064083649  0.1811352578  0.01645527
## md_q1k             -0.253871846  0.135164775 -0.1083114379  0.19114450
## md_q1l              0.103771257  0.124228992  0.0913237629  0.10945543
## md_q1m             -0.356974559 -0.083875233  0.2649960846  0.07664469
## md_q2              -0.171850246  0.031716664  0.0712162935 -0.12558114
## md_q3               0.162531921  0.039002813 -0.1061998218  0.28323765
## md_total            0.005290795  0.019323755 -0.0002904846  0.04265021
## alcohol             0.101046295 -0.130950510 -0.1743377008 -0.05086443
## thc                -0.260448159  0.115118494  0.2559443322  0.26543853
## cocaine            -0.133665496  0.038646011 -0.0878104464 -0.03711447
## stimulants         -0.112060163 -0.005252219  0.0954619503  0.07688555
## sedative_hypnotics  0.129499260 -0.260596252  0.0809655188  0.20148241
## opioids            -0.256493743  0.120058085 -0.3469397936  0.24416249
## court_order        -0.046960983  0.043873170 -0.1999846191 -0.06160760
## education           0.027584211  0.001538790 -0.0669431107  0.00637302
## hx_of_violence      0.027584211  0.001538790 -0.0669431107  0.00637302
## disorderly_conduct  0.165207538 -0.272893665 -0.1472912389  0.27901502
## suicide             0.016529805  0.212829718  0.0659077659  0.10013141
## abuse              -0.014115202 -0.081850623 -0.1216962003 -0.19786384
## non_subst_dx       -0.112290928 -0.101798580  0.1509651126  0.23410361
## subst_dx            0.185699385 -0.032378270  0.1891561946 -0.17117134
##                           PC25         PC26         PC27         PC28
## age                -0.26785346 -0.002253169 -0.127366151 -0.309086965
## sex                 0.04916629  0.089355875 -0.264718977 -0.073307282
## race               -0.08937166 -0.224083967 -0.014000175  0.082232949
## md_q1a             -0.34031571 -0.095689106 -0.013450032  0.066429003
## md_q1b              0.18105630 -0.347834216  0.199691499 -0.443849874
## md_q1c              0.07882840 -0.110232027 -0.216744793 -0.082686496
## md_q1d             -0.15329029 -0.045832716  0.113508812  0.041260242
## md_q1e             -0.01914802  0.344548430  0.343917831 -0.046766762
## md_q1f              0.23394239 -0.318611945  0.047248561 -0.009746954
## md_q1g             -0.13657360 -0.035592007 -0.209064624 -0.165224788
## md_q1h              0.05390273 -0.062122306 -0.299977635 -0.128100563
## md_q1i             -0.08160362  0.149251088  0.257758807  0.058417783
## md_q1j              0.04203937 -0.275594820  0.180658726  0.134289661
## md_q1k              0.11576300  0.274000729 -0.244072892 -0.006456198
## md_q1l             -0.15274544  0.230878519  0.194873631 -0.176675641
## md_q1m              0.30781921 -0.046028033 -0.042117541  0.088223976
## md_q2              -0.21134622  0.056851394 -0.368644429  0.474590528
## md_q3               0.08944600  0.325697793 -0.008096621  0.083313370
## md_total            0.01980200  0.047655620  0.007179400  0.002639981
## alcohol            -0.15476069 -0.290611714  0.017444649  0.158125747
## thc                -0.26301823 -0.041918997  0.180623314  0.015019790
## cocaine             0.18639578 -0.071499579  0.098500391  0.136635927
## stimulants          0.09543156 -0.178072306 -0.063935200 -0.067730874
## sedative_hypnotics -0.24287796 -0.141420419 -0.110126323 -0.023261608
## opioids            -0.13656991 -0.137744089  0.199086321  0.175074811
## court_order        -0.23445061  0.070125881 -0.035078251 -0.164432097
## education           0.05753363  0.030861419  0.049668262 -0.048468733
## hx_of_violence      0.05753363  0.030861419  0.049668262 -0.048468733
## disorderly_conduct  0.33326963  0.043939163 -0.202050282  0.062014207
## suicide             0.25613273  0.138254753  0.164969828 -0.044223272
## abuse               0.12189697 -0.042149329  0.004592098  0.299047946
## non_subst_dx       -0.01478651 -0.183997902 -0.092480103 -0.071575151
## subst_dx           -0.04425541  0.078928034 -0.209057892 -0.363240849
##                            PC29         PC30         PC31          PC32
## age                 0.180608169  0.005244402 -0.191215927  0.0047882298
## sex                -0.202131934  0.049319791  0.327907429 -0.0010897764
## race                0.170331437 -0.136731709  0.166632077  0.0046861276
## md_q1a              0.214050074  0.316998687 -0.059936603 -0.0884555682
## md_q1b              0.172297408 -0.061020840  0.118270328 -0.0979929477
## md_q1c             -0.285169720  0.012657671  0.127109383 -0.1010277811
## md_q1d             -0.158208574  0.106247058 -0.042375739 -0.0917618750
## md_q1e              0.116353935 -0.155536570  0.071895756 -0.0915455394
## md_q1f              0.071828922  0.049059380  0.050287944 -0.0979440165
## md_q1g             -0.225746161  0.390188385 -0.141826074 -0.0767674339
## md_q1h             -0.163932462 -0.274730610 -0.380673251 -0.0795874460
## md_q1i              0.259893527  0.240654902  0.281301014 -0.1032690575
## md_q1j              0.050071871 -0.178171936 -0.393095996 -0.0931272576
## md_q1k              0.068913345  0.334696850  0.106622557 -0.1026426523
## md_q1l             -0.399868384 -0.410440999  0.224667155 -0.1007298829
## md_q1m              0.141926964  0.105292203  0.072805250 -0.0998480201
## md_q2               0.245649787 -0.319154375  0.160377612 -0.0913522530
## md_q3              -0.070802457 -0.011747432 -0.314380498 -0.2046759901
## md_total           -0.007658571  0.017569575 -0.032249499  0.9127971557
## alcohol            -0.265887230 -0.028656734  0.167401154  0.0019317750
## thc                -0.106577691  0.140404296 -0.212588417 -0.0012956326
## cocaine            -0.242046781  0.090716536  0.111358104 -0.0082806317
## stimulants          0.010480051  0.027641891  0.094178121 -0.0025172554
## sedative_hypnotics -0.058346759  0.074134262  0.046529907  0.0021420482
## opioids             0.036590202  0.010835635  0.024435561 -0.0037791273
## court_order         0.078890841 -0.120406802  0.090754495 -0.0007958411
## education          -0.015385660  0.091091406 -0.023997820  0.0004329570
## hx_of_violence     -0.015385660  0.091091406 -0.023997820  0.0004329570
## disorderly_conduct  0.076250065 -0.004390021 -0.006114958  0.0067759746
## suicide             0.044638381 -0.015072008 -0.057500849  0.0034026240
## abuse               0.063769436 -0.080075987 -0.173345676  0.0056517048
## non_subst_dx        0.069950480 -0.146471307  0.227337964  0.0055267665
## subst_dx            0.360430181 -0.176006698 -0.042213313  0.0073415984
##                             PC33
## age                 0.000000e+00
## sex                 4.446564e-17
## race               -1.000491e-16
## md_q1a              2.588352e-16
## md_q1b              3.507338e-16
## md_q1c              6.302357e-16
## md_q1d              5.178221e-16
## md_q1e              1.907546e-16
## md_q1f              1.233500e-16
## md_q1g              2.085367e-16
## md_q1h              2.778133e-16
## md_q1i              2.859931e-16
## md_q1j              4.103841e-16
## md_q1k              3.666871e-16
## md_q1l              2.954597e-16
## md_q1m              2.173715e-16
## md_q2               3.447506e-16
## md_q3               5.957333e-16
## md_total           -3.085676e-15
## alcohol             2.859500e-16
## thc                -2.062939e-17
## cocaine            -1.448070e-16
## stimulants         -5.207871e-17
## sedative_hypnotics  2.756697e-17
## opioids            -4.820730e-19
## court_order         1.379461e-16
## education           7.071068e-01
## hx_of_violence     -7.071068e-01
## disorderly_conduct -1.034042e-16
## suicide            -1.493044e-16
## abuse              -1.382459e-16
## non_subst_dx       -1.450787e-16
## subst_dx           -8.218822e-17
names(pca.out)
## [1] "sdev"     "rotation" "center"   "scale"    "x"

Next, we will print the summary of the prcomp object and draw a biplot

# summary of the prcomp object
summary(pca.out)
## Importance of components:
##                           PC1     PC2     PC3     PC4    PC5     PC6
## Standard deviation     2.6499 1.66758 1.51779 1.41044 1.3175 1.23405
## Proportion of Variance 0.2128 0.08427 0.06981 0.06028 0.0526 0.04615
## Cumulative Proportion  0.2128 0.29705 0.36686 0.42714 0.4797 0.52589
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     1.20332 1.10727 1.05350 1.01224 0.98184 0.95678
## Proportion of Variance 0.04388 0.03715 0.03363 0.03105 0.02921 0.02774
## Cumulative Proportion  0.56977 0.60692 0.64056 0.67160 0.70082 0.72856
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.91261 0.87905 0.86427 0.83893 0.81415 0.77188
## Proportion of Variance 0.02524 0.02342 0.02264 0.02133 0.02009 0.01805
## Cumulative Proportion  0.75380 0.77721 0.79985 0.82117 0.84126 0.85932
##                          PC19    PC20    PC21    PC22   PC23    PC24
## Standard deviation     0.7511 0.72022 0.69616 0.65429 0.6371 0.63217
## Proportion of Variance 0.0171 0.01572 0.01469 0.01297 0.0123 0.01211
## Cumulative Proportion  0.8764 0.89213 0.90682 0.91979 0.9321 0.94420
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.59015 0.58270 0.52244 0.50912 0.49183 0.46030
## Proportion of Variance 0.01055 0.01029 0.00827 0.00785 0.00733 0.00642
## Cumulative Proportion  0.95475 0.96504 0.97331 0.98117 0.98850 0.99492
##                           PC31    PC32      PC33
## Standard deviation     0.40766 0.03891 4.072e-16
## Proportion of Variance 0.00504 0.00005 0.000e+00
## Cumulative Proportion  0.99995 1.00000 1.000e+00

Here we get principal components named PC1-PC52. Each of these explains a percentage of the total variation in the dataset. For example, PC1 explains nearly 22% of the total variance i.e. around One-fourth of the information of the dataset can be encapsulated by just that one Principal Component. PC2 explains 8% and so on.

Next, Plotting PCA: While plotting a PCA we refer to a scatterplot of the first two principal components PC1 and PC2. These plots reveal the features of data such as non-linearity and departure from normality. PC1 and PC2 are evaluated for each sample vector and plotted.

# loading library
library(ggfortify)

#The autoplot( ) function of the ‘ggfortify package’ for plotting PCA:
pca.out.plot <- autoplot(pca.out,
                          data = adhd,
                          colour='suicide')
## Warning: `select_()` was deprecated in dplyr 0.7.0.
## Please use `select()` instead.
pca.out.plot

For a better understanding of the linear transformation of features, biplot( ) function also be used to plot PCA.

biplot.pca.out <- biplot(pca.out)

biplot.pca.out
## NULL

For determining the ideal features which can be justified after performing PCA, the plot( ) function can be used to plot the precomp object.

plot.pca.out <- plot(pca.out, type="l")

plot.pca.out
## NULL

In a screeplot the ‘arm-bend’ represents a decrease in cumulative contribution. The above plot shows the bend at the second principal component.

which question may have a heavy bearing on the score : For PC1, we can say that the Question Q1a is the most important (md_q1a -0.26892181) and for PC2, we can say that the Question Q3 is the most important (md_q3 -0.236710147) along with the question on Cocaine (cocaine 0.381655078).